Linear Regression - Interpreting the result

In this notebook we use linear regression to predict the coefficients corresponding to the top eigenvectors of the measurements:

TAVG: The average temperature for day/location. (TMAX + TMIN)/2
TRANGE: The temperature range between the highest and lowest temperatures of the day. TMAX-TMIN.
SNWD: The depth of the accumulated snow.

These 9 variables are the output variables that we aim to predict.

The 4 input variables we use for the regression are properties of the location of the station:

latitude, longitude: location of the station.
elevation: the elevation of the location above sea level.
dist_coast: the distance of the station from the coast (in kilometers).

Read and parse the data



In [65]:

    
import pickle
import pandas as pd
!ls *.pickle  # check









    



stations_projections.pickle



In [66]:

    
!curl -o "stations_projections.pickle" "http://mas-dse-open.s3.amazonaws.com/Weather/stations_projections.pickle"









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2750k  100 2750k    0     0  1100k      0  0:00:02  0:00:02 --:--:-- 1100k



In [67]:

    
data = pickle.load(open("stations_projections.pickle",'r'))
data.shape









    Out[67]:





(12140, 8)



In [68]:

    
data.head(1)









    Out[68]:






  
    
      
      station
      TAVG_coeff
      TRANGE_coeff
      SNWD_coeff
      latitude
      longitude
      elevation
      dist_coast
    
  
  
    
      0
      USC00044534
      [3047.96236332, 1974.34852034, 150.560792408]
      [-2903.63287861, -236.907267527, 147.021790682]
      [0.19150300062, 0.187262808215, -0.0401379552536]
      36.0042
      -119.96
      73.2
      107.655



In [69]:

    
# break up the lists of coefficients separate columns
for col in [u'TAVG_coeff', u'TRANGE_coeff', u'SNWD_coeff']:
    for i in range(3):
        new_col=col+str(i+1)
        data[new_col]=[e[i] for e in list(data[col])]
    data.drop(labels=col,axis=1,inplace=True)
data.drop(labels='station',axis=1,inplace=True)
print data.columns
data.head(3)









    



Index([     u'latitude',     u'longitude',     u'elevation',    u'dist_coast',
         u'TAVG_coeff1',   u'TAVG_coeff2',   u'TAVG_coeff3', u'TRANGE_coeff1',
       u'TRANGE_coeff2', u'TRANGE_coeff3',   u'SNWD_coeff1',   u'SNWD_coeff2',
         u'SNWD_coeff3'],
      dtype='object')






    Out[69]:






  
    
      
      latitude
      longitude
      elevation
      dist_coast
      TAVG_coeff1
      TAVG_coeff2
      TAVG_coeff3
      TRANGE_coeff1
      TRANGE_coeff2
      TRANGE_coeff3
      SNWD_coeff1
      SNWD_coeff2
      SNWD_coeff3
    
  
  
    
      0
      36.0042
      -119.9600
      73.2
      107.65500
      3047.962363
      1974.348520
      150.560792
      -2903.632879
      -236.907268
      147.021791
      0.191503
      0.187263
      -0.040138
    
    
      1
      42.7519
      -124.5011
      12.8
      0.61097
      2072.149003
      880.454659
      -19.403966
      -1588.344065
      22.091593
      53.905710
      0.315438
      0.126292
      0.792079
    
    
      2
      47.1064
      -104.7183
      632.8
      1316.54000
      949.764151
      2361.836952
      132.430209
      -2802.638187
      -165.774139
      152.216161
      745.947252
      256.091735
      113.675894

Performing and evaluating the regression

As the size of the data is modest, we can perform the regression using regular python (not spark) running on a laptop. We use the library sklearn



In [70]:

    
from sklearn.linear_model import LinearRegression

Coefficient of determination

Computed by calling the method LinearRegression.score()

The regression score comes under several names: "Coefficient of determination", $R^2$, "R squared score", "percentage of variance explained", "correlation coefficient". It is explained in more detail in wikipedia.

Roughly speaking the $R^2$-score measures the fraction of the variance of the regression output variable that is explained by the prediction function. The score varies between 0 and 1. A score of 1 means that the regression function perfectly predicts the value of $y$. A score of 0 means that it does not predict $y$ at all.

Training score vs Test score

Suppose we fit a regression function with 10 features to 10 data points. We are very likely to fit the data perfectly and get a score of 1. However, this does not mean that our model truly explains the data. It just means that the number of training examples we are using to fit the model is too small. To detect this situation, we can compute the score of the model that was fit to the training set, on a test set. If the ratio between the test score and the training score is smaller than, say, 0.1, then our regression function probably over-fits the data.

Finding the importance of input variables

The fact that a regression coefficient is far from zero provides some indication that it is important. However, the size of these coefficients also depends on the scaling of the variables. A much more reliable way to find out which of the input variables are important is to compare the score of the regression function we get when using all of the input variables to the score when one of the variables is eliminated. This is sometimes called "sensitivity analysis"



In [86]:

    
# Compute score changes
def compute_scores(y_label,X_Train,y_Train,X_test,Y_test):
    lg = LinearRegression()
    lg.fit(X_Train,y_Train)

    train_score = lg.score(X_Train,y_Train)
    test_score = lg.score(X_test,Y_test)
    print('R-squared(Coeff. of determination): Train:%.3f, Test:%.3f, Ratio:%.3f\n' % (train_score,test_score,(test_score/train_score)))

    full=set(range(X_Train.shape[1])) #col index list
    for i in range(X_Train.shape[1]):
        L=list(full.difference(set([i])))  # fill in
        L.sort()
        r_train_X=X_Train[:,L]
        r_test_X=X_test[:,L]
        
        lg = LinearRegression()
        lg.fit(r_train_X,y_Train)
        r_train_score = lg.score(r_train_X,y_Train)
        r_test_score  = lg.score(r_test_X,Y_test)
        print "removed",data.columns[i],
        print "Score decrease: \tTrain:%5.3f" % (train_score-r_train_score),
        print "\tTest: %5.3f " % (test_score-r_test_score)

Partition into training set and test set

By dividing the data into two parts, we can detect when our model over-fits. When over-fitting happens, the significance on the test set is much smaller than the significance on the training set.



In [ ]:



In [87]:

    
from numpy.random import rand
N=data.shape[0]
train_i = rand(N)>0.5
Train = data.ix[train_i,:]
Test  = data.ix[~train_i,:]
print data.shape,Train.shape,Test.shape









    



(12140, 13) (6110, 13) (6030, 13)



In [88]:

    
print Train.ix[:,:4].head()









    



   latitude  longitude  elevation  dist_coast
0   36.0042  -119.9600       73.2   107.65500
1   42.7519  -124.5011       12.8     0.61097
2   47.1064  -104.7183      632.8  1316.54000
3   41.7500   -84.2167      247.2   685.50100
6   43.5167  -104.3333     1250.9  1462.50000



In [98]:

    
from matplotlib import pyplot as plt
%matplotlib inline
def plot_regressions(X_test, y_test, clf):
    print  X_test.shape
    print y_test.shape
    plt.scatter(X_test, y_test,  color='black')
    plt.plot(X_test, clf.predict(X_test), color='blue',linewidth=3)



In [102]:

    
from sklearn.cross_validation import train_test_split

train_X = Train.ix[:,:4].values
test_X=Test.ix[:,:4].values
input_names=list(data.columns[:4])

for target in ["TAVG","TRANGE","SNWD"]:
    for j in range(1,4):
        y_label = target+"_coeff"+str(j)
        train_y = Train[y_label]
        test_y = Test[y_label]
        lg = LinearRegression()
        lg.fit(train_X,train_y)

        print "\nTarget variable: ", y_label, '#'*40
        print "Coeffs: ",\
            ' '.join(['%s:%5.2f ' % (input_names[i],lg.coef_[i]) for i in range(len(lg.coef_))])
        
        compute_scores(y_label, train_X, train_y, test_X, test_y)









    



Target variable:  TAVG_coeff1 ########################################
Coeffs:  latitude:-153.41  longitude:-19.60  elevation:-0.69  dist_coast:-0.13 
R-squared(Coeff. of determination): Train:0.932, Test:0.930, Ratio:0.998

removed latitude Score decrease: 	Train:0.608 	Test: 0.618 
removed longitude Score decrease: 	Train:0.071 	Test: 0.062 
removed elevation Score decrease: 	Train:0.132 	Test: 0.116 
removed dist_coast Score decrease: 	Train:0.003 	Test: 0.003 

Target variable:  TAVG_coeff2 ########################################
Coeffs:  latitude:-5.46  longitude: 7.29  elevation:-0.15  dist_coast: 0.48 
R-squared(Coeff. of determination): Train:0.603, Test:0.584, Ratio:0.969

removed latitude Score decrease: 	Train:0.009 	Test: 0.004 
removed longitude Score decrease: 	Train:0.114 	Test: 0.117 
removed elevation Score decrease: 	Train:0.077 	Test: 0.057 
removed dist_coast Score decrease: 	Train:0.391 	Test: 0.380 

Target variable:  TAVG_coeff3 ########################################
Coeffs:  latitude:-4.02  longitude:-2.85  elevation: 0.01  dist_coast: 0.07 
R-squared(Coeff. of determination): Train:0.424, Test:0.392, Ratio:0.925

removed latitude Score decrease: 	Train:0.047 	Test: 0.052 
removed longitude Score decrease: 	Train:0.170 	Test: 0.141 
removed elevation Score decrease: 	Train:0.001 	Test: 0.002 
removed dist_coast Score decrease: 	Train:0.088 	Test: 0.089 

Target variable:  TRANGE_coeff1 ########################################
Coeffs:  latitude:23.62  longitude: 9.07  elevation:-0.34  dist_coast:-0.15 
R-squared(Coeff. of determination): Train:0.474, Test:0.440, Ratio:0.928

removed latitude Score decrease: 	Train:0.052 	Test: 0.055 
removed longitude Score decrease: 	Train:0.055 	Test: 0.044 
removed elevation Score decrease: 	Train:0.120 	Test: 0.120 
removed dist_coast Score decrease: 	Train:0.013 	Test: 0.014 

Target variable:  TRANGE_coeff2 ########################################
Coeffs:  latitude:-32.56  longitude: 6.14  elevation:-0.01  dist_coast: 0.14 
R-squared(Coeff. of determination): Train:0.662, Test:0.629, Ratio:0.950

removed latitude Score decrease: 	Train:0.467 	Test: 0.449 
removed longitude Score decrease: 	Train:0.119 	Test: 0.097 
removed elevation Score decrease: 	Train:0.001 	Test: 0.001 
removed dist_coast Score decrease: 	Train:0.049 	Test: 0.041 

Target variable:  TRANGE_coeff3 ########################################
Coeffs:  latitude: 3.92  longitude: 1.44  elevation: 0.04  dist_coast:-0.04 
R-squared(Coeff. of determination): Train:0.121, Test:0.072, Ratio:0.590

removed latitude Score decrease: 	Train:0.055 	Test: 0.027 
removed longitude Score decrease: 	Train:0.053 	Test: 0.038 
removed elevation Score decrease: 	Train:0.055 	Test: 0.036 
removed dist_coast Score decrease: 	Train:0.029 	Test: 0.016 

Target variable:  SNWD_coeff1 ########################################
Coeffs:  latitude:150.51  longitude:22.40  elevation: 1.15  dist_coast:-0.90 
R-squared(Coeff. of determination): Train:0.242, Test:0.229, Ratio:0.947

removed latitude Score decrease: 	Train:0.155 	Test: 0.154 
removed longitude Score decrease: 	Train:0.025 	Test: 0.024 
removed elevation Score decrease: 	Train:0.098 	Test: 0.090 
removed dist_coast Score decrease: 	Train:0.032 	Test: 0.032 

Target variable:  SNWD_coeff2 ########################################
Coeffs:  latitude: 1.51  longitude:-1.09  elevation:-0.22  dist_coast: 0.24 
R-squared(Coeff. of determination): Train:0.068, Test:0.061, Ratio:0.899

removed latitude Score decrease: 	Train:0.000 	Test: -0.000 
removed longitude Score decrease: 	Train:0.001 	Test: 0.001 
removed elevation Score decrease: 	Train:0.048 	Test: 0.045 
removed dist_coast Score decrease: 	Train:0.032 	Test: 0.027 

Target variable:  SNWD_coeff3 ########################################
Coeffs:  latitude: 8.29  longitude: 0.27  elevation: 0.09  dist_coast: 0.01 
R-squared(Coeff. of determination): Train:0.159, Test:0.113, Ratio:0.713

removed latitude Score decrease: 	Train:0.047 	Test: 0.034 
removed longitude Score decrease: 	Train:0.000 	Test: 0.001 
removed elevation Score decrease: 	Train:0.055 	Test: 0.044 
removed dist_coast Score decrease: 	Train:0.001 	Test: 0.000

Interpretation

When we find a statistically significant coefficient, we want to find a rational explanation for the significance and for the sign of the corresponding coefficient. Please write a one line explanation for each of the following nine input/output pairs (the ones that are numbered).

Target variable:  TAVG_coeff1 ########################################
Coeffs:  latitude:-153.98  longitude:-19.21  elevation:-0.68  dist_coast:-0.13 
R-squared(Coeff. of determination): Train:0.931, Test:0.931

1. removed latitude Score decrease:     Train:0.613     Test: 0.612 
* Removing the latitute had the largest effect on the accuracy of the prediction of TAVG_coeff1. That is because it is a very strongly negative weight relative to the other coefficeints, therefore it is an important feature.

2. removed elevation Score decrease:    Train:0.128     Test: 0.121 
* This feature of TAVG is highely dependent on elevation. 

Target variable:  TAVG_coeff2 ########################################
Coeffs:  latitude:-5.33  longitude: 7.46  elevation:-0.14  dist_coast: 0.48 
R-squared(Coeff. of determination): Train:0.603, Test:0.585

3. removed longitude Score decrease:    Train:0.115     Test: 0.116 
4. removed dist_coast Score decrease:   Train:0.393     Test: 0.378 

Target variable:  TAVG_coeff3 ########################################
Coeffs:  latitude:-4.19  longitude:-2.64  elevation: 0.01  dist_coast: 0.07 
R-squared(Coeff. of determination): Train:0.420, Test:0.398

5. removed longitude Score decrease:    Train:0.148     Test: 0.164 
6. removed dist_coast Score decrease:   Train:0.095     Test: 0.082 

Target variable:  TRANGE_coeff1 ########################################
Coeffs:  latitude:25.00  longitude: 8.63  elevation:-0.36  dist_coast:-0.15 
R-squared(Coeff. of determination): Train:0.478, Test:0.435

7. removed elevation Score decrease:    Train:0.127     Test: 0.113 

Target variable:  TRANGE_coeff2 ########################################
Coeffs:  latitude:-32.63  longitude: 6.04  elevation:-0.02  dist_coast: 0.14 
R-squared(Coeff. of determination): Train:0.649, Test:0.642

8. removed latitude Score decrease:     Train:0.461     Test: 0.454 

Target variable:  SNWD_coeff1 ########################################
Coeffs:  latitude:147.72  longitude:21.54  elevation: 1.09  dist_coast:-0.88 
R-squared(Coeff. of determination): Train:0.232, Test:0.238

9. removed latitude Score decrease:     Train:0.153     Test: 0.155

Write your answers here

Consult the plots of the eigen-vectors. SNWD is available in an earlier notebook. The statistics for TRANGE and TAVG is in the file http://mas-dse-open.s3.amazonaws.com/Weather/STAT_TAVG_RANGE.pickle

For each of the following eigen-vectors, give a short verbal description

TAVG_coeff1: Avg. Temp. across the year
TAVG_coeff2: Summer & winter temperature diff.
TAVG_coeff3: Fall & winter temp. diff.
TRANGE_coeff1: Summmer & Winter avg daily temp range diff
TRANGE_coeff2: Summer & winter temp change diff
SNWD_coeff1: Averge snow depth (winter)

Once you have given a meaning to each of these eigen-vectors, explain the relation to the input variable. Short explanations are better than long ones.

Increase in avg temp as you go south
Increase in avg temp as elevation decreses
Diff in summer and winter temperature as you go east
Summer & Winter temp. diff as function of distance from coast
Far east and west sides have higher difference in temperature between fall and winter compared to the central parts.
The temperature variance per day incrases as a function of distance to the coast
Average daily temperature range between summer and winter increase as we move lower in elevation.
Difference in termperature range between summer and winter increase as we move further south.
Locations wither high latitude (northern and central parts) get more snow in the winter time compared to the western parts.

	latitude	longitude	elevation	dist_coast	TAVG_coeff1	TAVG_coeff2	TAVG_coeff3	TRANGE_coeff1	TRANGE_coeff2	TRANGE_coeff3	SNWD_coeff1	SNWD_coeff2	SNWD_coeff3
0	36.0042	-119.9600	73.2	107.65500	3047.962363	1974.348520	150.560792	-2903.632879	-236.907268	147.021791	0.191503	0.187263	-0.040138
1	42.7519	-124.5011	12.8	0.61097	2072.149003	880.454659	-19.403966	-1588.344065	22.091593	53.905710	0.315438	0.126292	0.792079
2	47.1064	-104.7183	632.8	1316.54000	949.764151	2361.836952	132.430209	-2802.638187	-165.774139	152.216161	745.947252	256.091735	113.675894